Chicago Crime Theft Data Analysis

This is by us. Handling crime data of Chicago from 2001 to 2017

Thanks to These projects:

Importing Data

ERROR

When trying to read data with

Crime_2001_to_2004 = pd.read_csv("../Data/Crimes in Chicago_An extensive dataset of crimes in Chicago (2001-2017), by City of Chicago/Chicago_Crimes_2001_to_2004.csv")

Got an Error that goes like:

ParserError: Error tokenizing data. C error: Expected 23 fields in line 1513591, saw 24

The solution is:

Crime_2001_to_2004 = pd.read_csv("../Data/Crimes in Chicago_An extensive dataset of crimes in Chicago (2001-2017), by City of Chicago/Chicago_Crimes_2001_to_2004.csv", error_bad_lines=False)

So now it works:

b'Skipping line 1513591: expected 23 fields, saw 24\n' /home/hanpeng/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3165: DtypeWarning: Columns (17,20) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name,

The reason being:

读取文件时遇到和列数不对应的行,此时会报错。若报错行可以忽略,则添加该参数 When the column number of one line doesn't met the header row's

Data Combination

To Combine the four files' data into one

gc.collect() to guarentee to save RAM

Basic Info

Dropping Duplicated ones

What makes two cases duplicated?

It looks convincing, so be it! Let's drop duplicated ones by that:

Data Selection

There are three types of crime that fits the defination of stealing: THEFT, MOTOR VEHICLE THEFT, BURGLARY.

So we should only keep data of them and forget about others.

gc.collect() to make sure RAM stay low

Handling NaN and Missing Value

Turning it into df

There are missing values and null values:

So we need to deal with them

Date Processing

Deleting 2017

There's something wrong with the result at the year of 2017:

The reason is the data only stops at 2017.1.18:

So for the sake of convenience, we should just delete data in 2017

Deleting 2004 and Before

There's something wrong with the data at the spring of 2004 and before the middle of 2002

So we decided to delete them

Remember to clean the RAM

Info Once Again

Data Analysis and Visualization

Percentages

It's too time and RAM consuming, so I'll leave out the plotting part.

dt.floor('d') means to convert the date to only day level

There seems to be an anual pattern.

Also

There seems to be no monthly patterns

Also

Location Analysis

What it looks like in bar graphs:

So, in average, there are:

for each block

Paint the Map with Crime Data

Red means dangerous, green means safe

This is to make sure that bad blocks will be scattered later, since the later ones will cover the earlier ones

To calculate the best vmin and vmax with Interquartile Range related method

Folium Limit to Chicago

Block Crimes on Map with Radius and Colors

Put block crimes onto the map with circles with radius

To get the best radius, extreme values needs to be handled

Radius 0 stands for no crime, then lower and upper should be converted to from 1(or other starting point) to 100(or other up limit)

So we need a function that fits: (x=lower, y=1) and (x=upper, y=100 or other target values)

the function will convert from crimeCount to radius value

We can get the a and b value of y = ax + b from basic math operations

blockCrimeCount.max()

blockCrimeCount.min()

blockCrimeCount.mean()

blockCrimeCount.std()

Test

Prediction

Here's an intro for Prophet:

模型预测 - Prophet Facebook 所提供的 prophet 算法不仅可以处理时间序列存在一些异常值的情况,也可以处理部分缺失值的情形,还能够几乎全自动地预测时间序列未来的走势。prophet 所做的事情就是:

输入已知的时间序列的时间戳和相应的值;

输入需要预测的时间序列的长度;

输出未来的时间序列走势。

输出结果可以提供必要的统计指标,包括拟合曲线,上界和下界等。

Learning to Use the Model

If you want to add a new time period

User Predict

If you want to add a new time to predict

Manually Evaluate Test

Mean Absolute Error, l1

Mean Squared Error, l2

R2

Saving and Opening the Model

According to https://facebook.github.io/prophet/docs/additional_topics.html#updating-fitted-models